We will compare some classifiers on the “Toxic” column.

Load libraries

library(tidyverse)
package 㤼㸱tidyverse㤼㸲 was built under R version 4.0.5Registered S3 methods overwritten by 'dbplyr':
  method         from
  print.tbl_lazy     
  print.tbl_sql      
-- Attaching packages --------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.5     v purrr   0.3.4
v tibble  3.1.3     v dplyr   1.0.7
v tidyr   1.1.3     v stringr 1.4.0
v readr   2.0.1     v forcats 0.5.1
package 㤼㸱ggplot2㤼㸲 was built under R version 4.0.5package 㤼㸱tibble㤼㸲 was built under R version 4.0.5package 㤼㸱tidyr㤼㸲 was built under R version 4.0.5package 㤼㸱purrr㤼㸲 was built under R version 4.0.5package 㤼㸱dplyr㤼㸲 was built under R version 4.0.5package 㤼㸱stringr㤼㸲 was built under R version 4.0.5package 㤼㸱forcats㤼㸲 was built under R version 4.0.5-- Conflicts ------------------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
library(tictoc)
package 㤼㸱tictoc㤼㸲 was built under R version 4.0.5
library(caret)
package 㤼㸱caret㤼㸲 was built under R version 4.0.5Loading required package: lattice
Registered S3 method overwritten by 'data.table':
  method           from
  print.data.table     

Attaching package: 㤼㸱caret㤼㸲

The following object is masked from 㤼㸱package:purrr㤼㸲:

    lift
library(class)
source("./parameters.R")

Set the value of K for the run

# Number of nearest neighbors taken into account
k = 5

Open the Bag of Word with labels

# We open a relatively small Bag of Words in order to limit calculation time
fileName = "bow_tfidf__min_words_100_2grams_1000__sampling_balanced__cor_cut_0.3_from_1408_to_1110_rm0.csv"
df = read_csv(fileName, col_types=col_types_df)
df = df[,-c(2,3,5:9)]
df

Remove lines full of zeros

KNN is not really friend with lines full of zeros in the bag of words. All these lines have way to many neighbors. So let’s ensure there is none.

# Go through each row, return TRUE is at least one value is not zero
non_zero_rows = apply(df[,-1], 1, function(row) any(row !=0 ))
writeLines(paste0("Rows full of zeros: ",sum(!non_zero_rows, na.rm = TRUE)))
Rows full of zeros: 0
# Subset
df = df[non_zero_rows,]
writeLines(paste0("Remaning rows: ",dim(df)[1]))
Remaning rows: 39972

Splitting the data

# Split between train and test
df_train = df[df[1] == 1,-1]
df_test  = df[df[1] == 2,-1]

# Split the test set between features and labels
X_train = df_train[,-1]
Y_train = df_train$df_toxic

# Split the train set between features and labels
X_test = df_test[,-1]
Y_test = df_test$df_toxic

Have a look

X_train
as.data.frame(Y_train) 
X_test
as.data.frame(Y_test)

Train the model

tic("Training: ")

f <- knn(X_train, X_test, Y_train, k = k)
Error in knn(X_train, X_test, Y_train, k = k) : too many ties in knn

This takes some time…

What’s in f ?

What does the confusion matrix gives us?

Results

bow_tfidf__min_words_100_2grams_1000__sampling_balanced__cor_cut_0.001_from_1408_to_32.csv

LS0tDQp0aXRsZTogIktOTiINCm91dHB1dDogaHRtbF9ub3RlYm9vaw0KLS0tDQoNCldlIHdpbGwgY29tcGFyZSBzb21lIGNsYXNzaWZpZXJzIG9uIHRoZSAiVG94aWMiIGNvbHVtbi4NCg0KTG9hZCBsaWJyYXJpZXMNCg0KYGBge3J9DQpsaWJyYXJ5KHRpZHl2ZXJzZSkNCmxpYnJhcnkodGljdG9jKQ0KbGlicmFyeShjYXJldCkNCmxpYnJhcnkoY2xhc3MpDQpzb3VyY2UoIi4vcGFyYW1ldGVycy5SIikNCmBgYA0KDQojIFNldCB0aGUgdmFsdWUgb2YgSyBmb3IgdGhlIHJ1bg0KDQpgYGB7cn0NCiMgTnVtYmVyIG9mIG5lYXJlc3QgbmVpZ2hib3JzIHRha2VuIGludG8gYWNjb3VudA0KayA9IDUNCmBgYA0KDQoNCiMgT3BlbiB0aGUgQmFnIG9mIFdvcmQgd2l0aCBsYWJlbHMNCg0KYGBge3J9DQojIFdlIG9wZW4gYSByZWxhdGl2ZWx5IHNtYWxsIEJhZyBvZiBXb3JkcyBpbiBvcmRlciB0byBsaW1pdCBjYWxjdWxhdGlvbiB0aW1lDQpmaWxlTmFtZSA9ICJib3dfdGZpZGZfX21pbl93b3Jkc18xMDBfMmdyYW1zXzEwMDBfX3NhbXBsaW5nX2JhbGFuY2VkX19jb3JfY3V0XzAuM19mcm9tXzE0MDhfdG9fMTExMF9ybTAuY3N2Ig0KZGYgPSByZWFkX2NzdihmaWxlTmFtZSwgY29sX3R5cGVzPWNvbF90eXBlc19kZikNCmRmID0gZGZbLC1jKDIsMyw1OjkpXQ0KZGYNCmBgYA0KIyBSZW1vdmUgbGluZXMgZnVsbCBvZiB6ZXJvcw0KDQpLTk4gaXMgbm90IHJlYWxseSBmcmllbmQgd2l0aCBsaW5lcyBmdWxsIG9mIHplcm9zIGluIHRoZSBiYWcgb2Ygd29yZHMuDQpBbGwgdGhlc2UgbGluZXMgaGF2ZSB3YXkgdG8gbWFueSBuZWlnaGJvcnMuDQpTbyBsZXQncyBlbnN1cmUgdGhlcmUgaXMgbm9uZS4NCg0KYGBge3J9DQojIEdvIHRocm91Z2ggZWFjaCByb3csIHJldHVybiBUUlVFIGlzIGF0IGxlYXN0IG9uZSB2YWx1ZSBpcyBub3QgemVybw0Kbm9uX3plcm9fcm93cyA9IGFwcGx5KGRmWywtMV0sIDEsIGZ1bmN0aW9uKHJvdykgYW55KHJvdyAhPTAgKSkNCndyaXRlTGluZXMocGFzdGUwKCJSb3dzIGZ1bGwgb2YgemVyb3M6ICIsc3VtKCFub25femVyb19yb3dzLCBuYS5ybSA9IFRSVUUpKSkNCiMgU3Vic2V0DQpkZiA9IGRmW25vbl96ZXJvX3Jvd3MsXQ0Kd3JpdGVMaW5lcyhwYXN0ZTAoIlJlbWFuaW5nIHJvd3M6ICIsZGltKGRmKVsxXSkpDQpgYGANCg0KDQojIFNwbGl0dGluZyB0aGUgZGF0YQ0KDQpgYGB7cn0NCiMgU3BsaXQgYmV0d2VlbiB0cmFpbiBhbmQgdGVzdA0KZGZfdHJhaW4gPSBkZltkZlsxXSA9PSAxLC0xXQ0KZGZfdGVzdCAgPSBkZltkZlsxXSA9PSAyLC0xXQ0KDQojIFNwbGl0IHRoZSB0ZXN0IHNldCBiZXR3ZWVuIGZlYXR1cmVzIGFuZCBsYWJlbHMNClhfdHJhaW4gPSBkZl90cmFpblssLTFdDQpZX3RyYWluID0gZGZfdHJhaW4kZGZfdG94aWMNCg0KIyBTcGxpdCB0aGUgdHJhaW4gc2V0IGJldHdlZW4gZmVhdHVyZXMgYW5kIGxhYmVscw0KWF90ZXN0ID0gZGZfdGVzdFssLTFdDQpZX3Rlc3QgPSBkZl90ZXN0JGRmX3RveGljDQpgYGANCg0KIyBIYXZlIGEgbG9vaw0KDQpgYGB7cn0NClhfdHJhaW4NCmFzLmRhdGEuZnJhbWUoWV90cmFpbikgDQpYX3Rlc3QNCmFzLmRhdGEuZnJhbWUoWV90ZXN0KQ0KYGBgDQoNCg0KIyBUcmFpbiB0aGUgbW9kZWwNCg0KYGBge3J9DQp0aWMoIlRyYWluaW5nOiAiKQ0KDQpmIDwtIGtubihYX3RyYWluLCBYX3Rlc3QsIFlfdHJhaW4sIGsgPSBrKQ0KDQp0b2MobG9nID0gVFJVRSkNCmBgYA0KVGhpcyB0YWtlcyBzb21lIHRpbWUuLi4NCg0KV2hhdCdzIGluIGYgPw0KDQpgYGB7cn0NCmYNCmBgYA0KDQoNCldoYXQgZG9lcyB0aGUgY29uZnVzaW9uIG1hdHJpeCBnaXZlcyB1cz8NCg0KYGBge3J9DQp3cml0ZUxpbmVzKCJcbiIpDQptYXQgPSBjb25mdXNpb25NYXRyaXgoWV90ZXN0LCBhcy5mYWN0b3IoZikpDQptYXQNCmBgYA0KDQpgYGB7cn0NCnByaW50KCJFTkQ6IGFsbCB0aGUgbm90ZWJvb2sgcmFuLiIpDQpTeXMudGltZSgpDQoNCndyaXRlTGluZXMocGFzdGUwKCJGaWxlOiAiLCBmaWxlTmFtZSkpDQp3cml0ZUxpbmVzKHBhc3RlMCgiUGFyYW1ldGVyIGs6ICIsIGspKQ0Kd3JpdGVMaW5lcyhwYXN0ZTAoIkFjY3VyYWN5OiAiLCBtYXQkb3ZlcmFsbFsxXSkpDQp3cml0ZUxpbmVzKHBhc3RlMCh0aWMubG9nKGZvcm1hdCA9IFRSVUUpWzFdWzFdKSkNCndyaXRlTGluZXMocGFzdGUwKHRpYy5sb2coZm9ybWF0ID0gVFJVRSlbMl1bMV0pKQ0KYGBgDQoNCiMjIFJlc3VsdHMNCg0KIyMjIGJvd190ZmlkZl9fbWluX3dvcmRzXzEwMF8yZ3JhbXNfMTAwMF9fc2FtcGxpbmdfYmFsYW5jZWRfX2Nvcl9jdXRfMC4wMDFfZnJvbV8xNDA4X3RvXzMyLmNzdg0KDQoNCg0K